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The Dual Mandate of the Census Bureau 


► Collect data and disseminate accurate statistics about the 
US population. 

► Protect the privacy of individual data. 



Privacy Loss (e) 




Overview of TopDown and Trade-off 


The 2020 Disclosure Avoidance System (DAS), in 
particular its core TopDown algorithm, provides a 
production technology. The production activity is up to 
policy makers, however the choice is constrained by the 
given technology. 

Empirical results are useful for modeling the production 
possibility frontier (PPF). Understanding the privacy-loss, 
accuracy trade-off specifically associated with the 
technology at hand is necessary for informed decision 
making. 

Empirical results measure fitness-for-use against 
important use-cases such as redistricting. 



Disclosure Avoidance System (DAS) 

► The DAS is a small component of the entire 2020 Census 
operation. 

► The codomain of the DAS needs to lie in the domain of 
any system it left composes with. 

► Historically, this implied DAS must output microdata. 
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Rethinking the Microdata Requirement 


Historically, all the major data products were sourced by 
microdata. 

Improperly constrains production technology that defines 
the PPF. Simple example: Laplace Mechanism vs 
NoisyMax. 

The end-goal is high utility data products. Allowing for 
greater flexibility in the algorithm design will lead to 
better production technologies. 

Don’t need to abandon microdata completely though. 
What can be produced well as microdata? 



Redesign of Data Products 


Microdata supported products (PL94, DHC-Persons and 
DHC-Households, Demo Profile). 

Other Products (Detailed Race/AIAN, Person-Household 
Joins) - Out of scope for TopDown. 

Table classification: 

► by geography. 

► by universe: what is the item being counted (person or 
household)? 



Rethinking the Microdata Detail File (MDF) 
Specifications 


► Historically, wide variety of variables were preserved 
through the DA process (date of birth, mafid, allocation 
flags). 

► Reverse engineer microdata schema to meet demands of 
revised data products. 

► Attributes/attribute domains should have as much detail 
as necessary but no more. 

► Microdata will consist of two disjoint files (one for each 
universe: person and household). 

► Each record universe is the Cartesian product of its 
attribute domains, after eliminating structural zeros. 
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MDF Person 



► Naive cardinality of Person Record Universe (excludi 
block id) = 83,311,200 ~ 83 million. 
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MDF Unit 


11 

VACS 

Vacancy Status 

CHAR(l) 

0 = NIU 

1 = Vacant, for rent 

2 = Vacant, rented, not occupied 

3 = Vacant, for sale only 

4 = Vacant, sold, not occupied 

5 = Vacant, for seasonal, 
recreational, or occasional use 

6 = Vacan^for mfcfant w/nrkers 

7 = Vacant, other 

12 

HHSIZE 

Population Count 

INT(5} 

0 {vacant or NIU), 1, 2, 3. 4, 5, 6, 7+ 

13 

HHT 

Household/Family 

Type 

CHAR(l) 

0 = NIU 

1 = Married couple household 

2 = Other family household: Male 
householder 


► Naive cardinality of the Unit Record Universe (excluding 
block id) = 7,188,480,000,000 ~ 7 trillion. 
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Refining the Record Universes 


Removing structural zeros/merging variables: 

► A person cannot reside simultaneously in a household 
and GQ. 

► A 1-person household cannot be a married family 
household. 

Managing corner cases separately: 

► Vacancy status does not cross with occupied household 
attributes. 

Person and Household histograms are both under 3 
million cells. 

For comparison, the 2018 E2E code ran on a 2,000 cell 
histogram. 

Largest successful runs have been on a subset of the 
person variables, roughly 467k cells. 



Initializing the DAS 


The core TopDown algorithm will run twice (once for 
Persons and once for Households). 

► One consequence is that it is difficult to maintain 
consistency accross universes. 

► We strive for within universe consistency. Stakeholder 
input is vital here. 

Primary input is the confidential Census Edited File 
(CEF). 

The CEF is treated as ground truth. That is, our privacy 
analysis does not account for operations proceeding the 
DAS including edit and imputation procedures. 

Assumption: Input data are clean. No missing values, out 
of range values, etc. 



2010 Test Products: Intro 


The DAS team is generating test products that 
demonstrate the computational capabilities of the DAS at 
present. 

The DAS is capable of processing the 2010 CEF and 
producing protected microdata adhering to (slightly 
simplified) 2020 MDF specifications. 

The test MDF can be used to tabulate about 70% of the 
tables in the proposed 2020 DHC data product. 



2010 Test Products: Scale 


The DAS is capable of processing the entire nation 
(~ 310 million person records and ~ 120 million 
household records) rather than a small test area (such as 
Providence, Rl in the 2018 End-to-End (E2E) test). 

The “slightly simplified" 2020 MDF specification 
translates to roughly 200 times the scale of the 2018 E2E 
test in terms of histogram size. 

The DAS can produce microdata for persons and 
households. Households characteristics were essentially 
non-existent in the 2018 E2E test. 



2010 Test Products: Resource constraints 


► The DAS operates on Amazon Web Services Elastic Map 
Reduce computer clusters. 

► Operating at-scale requires about 18 worker nodes 
(r4.16xLarge, 64 core - 488 gb RAM) 

► Observed run times: ~ 20 hours to produce the 
household microdata and ~ 60 hours to produce the 
person microdata. 



2010 Test Products: Privacy-loss, Accuracy 

Tradeoff 


Statistic: Total Population Only 

Accuracy as a Fxn of Privacy-Loss Budget (for New Mexico), Geolevel 
(Data Product: DHC-P) 
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Privacy Loss Budget (PLB) 








2010 Test Products: Privacy-loss, Accuracy 

Tradeoff 


Statistic: (raceAlone & 2+ races) x HISP 
Accuracy as a Fxn of Privacy-Loss Budget (for New Mexico), Geolevel 
(Data Product: DHC-P) 
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Privacy Loss Budget (PLB) 






2010 Test Products: Privacy-loss, Accuracy 

Tradeoff 


Statistic: CENRACE x HISP Sub-Histogram 
Accuracy as a Fxn of Privacy-Loss Budget (for New Mexico), Geolevel 
(Data Product: DHC-P) 
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Privacy Loss Budget (PLB) 
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Backup: Artificial Geolevels 


ZIP Code Tabulation Areas 


School Districts 
Congressional Districts 


Voting Districts 
Traffic Analysis Zones 
County Subdivisions 



NATION 


REGIONS 
1 

DIVISIONS 
1 

STATES. 

I 

Countie 


Census Tracts 


AIANNH Areas*- 



(American Indian, Alaska 
Native, Native Hawaiian 
Areas) 


Urban Areas 

Core Based Statistical Areas 


Urban Growth Areas 
State Legislative Districts 
Public Use Microdata Areas 


Subminor Civil Divisions 


Census Blocks 







Backup: Artificial Geolevels 


► Introduce artificial geolevels into main hierarchy to help 
with scaling. Reduces maximum number of children at a 
given level - a known bottleneck in scaling. 

► Prelimary results are promising, especially between county 
and tracts, with regard to improving tractibility of large 
scale runs. Full impact on accuracy still being analyzed. 



Backup: Detailed Race/AIAN and 
Person-Household Joins 


Why not microdata? 

► High sensitivity. 

► Complex consistency requirements. 

► Non-standard geographies. 

► TopDown scalibility. 

► Fundamentally different algorithms required for joins. 

► Small count bias will look even worse. 

For joins, considering PrivateSQL: produces a privatized 
view from a relational database from which tables can be 
published [KTHFMHM19]. 

For Detailed Races/AIAN, considering a variant of 
TopDown that relaxes many of the consistency 
requirements, and only crosses race with select other 
variables. 



